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SPECULATIVE CODE MOTION FOR MEMORY LATENCY HIDING 
BACKGROUND 

[0001] Network processors (NP) may be used for packet processing. However, the 
5 latency for one external memory access in network processors may be larger than 
the worst-case service time. Therefore, network processors may have a parallel 
multiprocessor architecture, and perform asynchronous (non-blocking) memory 
access operations, so that the latency of memory accesses can be overlapped with 
( computation work in other threads. For instance, an example of network processors 

10 may process packets in its Microengine cluster, which consists of multiple 
Microengines (programmable processors with packet processing capability) 
running in parallel. Every memory access instruction may be non-blocking and 
associated with an event signal. That is, in response to a memory access 
instruction, other instructions following the memory access instruction may continue 
15 to run during the memory access. The other instructions may be blocked by a wait 
instruction for the associated event signal. When the associated event signal is 
asserted, the wait instruction may clear the event signal and return to execution. 

i 

v Consequently, all the instructions between the memory access instruction and the 

wait instruction may be overlapped with the latency of the memory access. 



20 BRIEF DESCRIPTION OF THE DRAWINGS 

[0002] The invention described herein is illustrated by way of example and not by way 

of limitation in the accompanying figures. For simplicity and clarity of illustration, 

elements illustrated in the figures are not necessarily drawn to scale. For example, 

the dimensions of some elements may be exaggerated relative to other elements 

1 



for clarity. Further, where considered appropriate, reference labels have been 
repeated among the figures to indicate corresponding or analogous elements. 

[0003] FIG. 1 illustrates an embodiment of a computing device. 

[0004] FIG. 2 illustrates an embodiment of a network device. 

[0085] FIG. 3 illustrates an embodiment of a method that may be used for memory 
latency hiding. 

[0006] FIGS. 4A-4C each illustrates an embodiment of a representation of a program 

that comprise a memory access instruction. 
[0007] FIGS. 5A and 5B are time sequence diagrams that each illustrates an 

1 0 embodiment of a latency of a memory access instruction. 
[0008] FIGS. 6A and 6B each illustrates an embodiment of a representation of a 

compiler to enforce the dependence for a memory access instruction. 
[0009] FIGS. 7A-7C illustrate an embodiment of a speculative code motion for a wait 

instruction. 

[0050] FIG. 8 illustrates an embodiment of a compiler. 
DETAILED DESCRIPTION 

[001 1 ] The following description describes techniques to hide memory access latency. 
The implementation of the techniques is not restricted in network processors; it may 
be used by any execution environments for similar purposes. In the following 
20 description, numerous specific details such as logic implementations, opcodes, 
means to specify operands, resource partitioning/sharing/duplication 
implementations, types and interrelationships of system components, and logic 
partitioning/integration choices are set forth in order to provide a more thorough 
understanding of the present invention. However, the invention may be practiced 



without such specific details. In other instances, control structures and full software 
instruction sequences have not been shown in detail in order not to obscure the 
invention. 

[0012] References in the specification to "one embodiment", "an embodiment", "an 
5 example embodiment", etc., indicate that the embodiment described may include a 
particular feature, structure, or characteristic, but every embodiment may not 
necessarily include the particular feature, structure, or characteristic. Moreover, 
such phrases are not necessarily referring to the same embodiment. Further when 

■ 

a particular feature, structure, or characteristic is described in connection with ah 
10 embodiment, it is submitted that it is within the knowledge of one skilled in the art to 

effect such feature, structure, or characteristic in connection with other 

embodiments whether or not explicitly described. 
[001 3] Embodiments of the invention may be implemented in hardware, firmware, 

software, or any combination thereof. Embodiments of the invention may also be 
15 implemented as instructions stored on a machine-readable medium, which may be 

read and executed by one or more processors. A machine-readable medium may 

include any mechanism for storing or transmitting information in a form readable by 

* * ■ 

a machine (e.g., a computing device). For example, a machine-readable medium 

may include read only memory (ROM); random access memory (RAM); magnetic 
20 disk storage media; optical storage media; flash memory devices; electrical, optical, 

acoustical or other forms of propagated signals (e.g., earner waves, infrared signals, 

digital signals, etc.), and others. 
[0014] An example embodiment of a computing device 100 is shown in FIG. 1. The 

computing device 100 may comprise one or more processors 110 coupled to a 
25 chipset 120. The chipset 120 may comprise one or more integrated circuit 
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packages or chips that couple the processor 1 10 to system memory 130, storage 
device 150, and one or more I/O devices 160 such as, for example, mouse, 
keyboard, video controller, etc. of the computing device 100. 
[0015] Each processor 1 10 may be implemented as a single integrated circuit, multiple 
5 integrated circuits, or hardware with software routines (e.g., binary translation 
routines). The processor 1 10 may perform actions in response to executing 
instructions. For example, the processor 110 may executes programs, performs 
data manipulations and control tasks in the computing device 1 00. The processor 
1 10 may be any type of processor adapted to execute instructions from memory 
130, such as a microprocessor, a digital signal processor, a microcontroller, or 
another processor. 

[0016] The memory 130 may comprise one or more different types of memory devices 
such as, for example, dynamic random access memory (DRAM) devices, static 
random access memory (SRAM) devices, read-only memory (ROM) devices, 
1 5 and/or other volatile or non-volatile memory devices. The memory 120 may store 
instructions and codes represented by data signals that may be executed by 
processor 1 10. In one embodiment, a compiler 140 may be stored in the memory 
120 and implemented by the processor 110. The compiler 140 may comprise any 
type of compiler adapted to generate data, code, information, etc., that may be 
20 stored in memory 1 30 and accessed by processor 110. 
[001 7] The chipset 120 may comprise a memory controller 1 80 that may control access 
to the memory 130. The chipset 120 may further comprise a storage device 
interface (not shown) that may access the storage device 150. The storage device 
150 may comprise a tape, a hard disk drive, a floppy diskette, a compact disk (CD) 
ROM, a flash memory device, other mass storage devices, or any other magnetic 



25 



or 

4 




optic storage media. The storage device 150 may store information, such as code, 
programs, files, data, applications, and operating systems. The chipset 120 may 
further comprise one or more I/O interfaces (not shown) to access the I/O device 
160 via buses 112 such as, for example, peripheral component interconnect (PCI) 
5 buses, accelerated graphics port (AGP) buses, universal serial bus (USB) buses, 
low pin count (LPC) buses, and/or other I/O buses. 
[0018] The I/O device 160 may include any I/O devices to perform I/O functions. 
Examples of the I/O device 1 60 may include controller for input devices (e.g., 
keyboard, mouse, trackball, pointing device), media card (e.g., audio, video, 
10 graphics), network card, and any other peripheral controllers. 
[0019] An embodiment of a network device 200 is shown in FIG. 2. The network device 
200 may enable transfer of packets between a client and a server via a network. 
The network device 200 may comprise a network interface 210, a network 
processor 220, and a memory 250. The network interface 210 may provide 
1 5 physical, electrical, and protocol interfaces to transfer packets. For example, the 
network interface 21 0 may receive a packet and send the packet to the network 
processor 220 for further processing. 
[0020] The memory 250 may store one or more packets and packet related information 
that may be used by the network processor 220 to process the packets. In one 
20 embodiment, the memory 250 may store packets, look-up tables, data structures 
that enable the network processor 220 to process the packets. In one embodiment, 
the memory 250 may comprise a dynamic random access memory (DRAM) and a 
static random access memory (SRAM). 
[002 1 ] The network processor 220 may receive one or more packets from the network 
25 interface 210. process the packets, and send the packets to the network interface 

5 
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210. In one embodiment, the network processor 220 may comprise a network 
processor, for example. Intel® IXP2400 network processor. The network processor 
220 may comprise a memory controller 240 that may control access to memory 250. 
For example, the network processor 220 may perform asynchronous or 
5 non-blocking memory access operations on memory 250 under control of the 
memory controller 240. In one embodiment, the memory controller 240 may be 
located outside the network processor 220. 
[0022] The network processor 220 may further comprise microengines 230-1 through 
230-N that may run in parallel. The microengine 230-1 through 230-N may 
cooperatively operate to process the packets. Each microengine may process a 
portion of the packet processing task. The processing of a packet may comprise 
sub-tasks such as packet validation. IP lookup, determining the type of service 
(TOS), time to live (TTL). out going address and the MAC address. In one 
embodiment, each microengine may comprise one or more threads and each 
15 thread may perform a sub-task. Forexample, the microengine 230-1 maycomprise 
threads such as 231-0 to 231-3. However, other embodiments may comprise a 
different number of threads such as. for example, eight threads. Each microengine 
may comprise a signal register file and a pseudo register file. For example, the 
microengine 230-1 may comprise a signal register file 234 and a pseudo register file 
20 236. The signal register file 234 may comprise one or more registers that each may 
store an asynchronous signal corresponding to a memory access instruction. The 
pseudo register file 236 may comprise one or more registers that each may store a 
pseudo signal. 

[0023] in the following, an example embodiment of a process as shown in FIG. 3 will be 
25 described in combination with FIGs. 4-7. In block 302. the compiler 140 may extract 



from an I/O instruction or memory access instruction an asynchronous signal that 
may represent latency associated with the I/O instruction. In one embodiment, the 
I/O instruction or memory access instruction may comprise a store instruction, a 
load instruction, etc. For a program 400 as shown in FIG. 4A, the compiler 140 may 
5 extract an asynchronous signal s from a store instruction 41 2. After the extraction, 
the compiler 140 may represent the store instruction 412 as a store instruction 422 
(FIG. 4B) associated with the asynchronous signal s. The compiler 140 may further 
generate a wait instruction 424 that wait the asynchronous signal s. In one 
embodiment, the signal register file 234 may comprise a signal register to store the 
10 asynchronous signal s. As shown in FIG. 4C, the asynchronous signal s may 

represent a dependence or dependence constraint, for example, between the store 
instruction 422 and the wait instruction 424 explicitly in the compiler 140, so that 
optimization of the latency, as well as other optimizations, may continue to work on 
the dependence of the program 400. 
[0034] In order to enforce the dependence between a memory access instruction and a 
wait instruction, the compiler 140 may use a relationship of define and use. For 
example, referring to an internal representation of the compiler 140 as shown in FIG. 
6A, the compiler 140 may make a store instruction 612 define an asynchronous 
signal extracted from the store instruction 612. The compiler 140 may further make 
20 a wait instruction 614 wait for or use the asynchronous signal. Similarly, as shown in 
FIG. 6B, the compiler 140 may make a load instruction 622 define an asynchronous 
signal extracted from the load instruction 622 and may make a wait instruction 624 
wait for or use the asynchronous signal. In one embodiment, a signal register to 
. store the asynchronous signal may be introduced in a network device, for example, 
25 as shown in FIG. 2. For example, the signal register file 234 of FIG. 2 may comprise 
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one or more signal registers that may each store an asynchronous signal extracted 

* 

from a memory access instruction. 

♦ 

[0025) Referring to FIGS. 5A and 5B, embodiments of instructions that may have 

dependence on associated wait instructions are illustrated. As shown in FIG. 5A, 
5 after issue of a load instruction, R1=load [R2], signal s (512), an instruction, 

R2=R1+1 , that uses a result R1 of the load instruction has to wait for the completion 
of the load operation (516), i.e., after the asynchronous signal of the load instruction 
is asserted and the result R1 is ready (514). The instruction may not be overlapped 
with latency A of the load instruction. Similarly, FIG. 5B illustrates a similar situation 
10 for an instruction, R2=R3+1, that overwrites a source R2 of a store instruction, 
[R1]=store R2, signal. 

[0026] In order to enforce a dependence of an instruction that depends on the 

completion of a memory access instruction and a wait instruction associated with 
the memory access instruction, the compiler 140 may also employ a relationship of 
15 define and use. In one embodiment, the compiler 140 may introduce a pseudo 
signal register for each signal register for a memory access instruction. Referring to 
the internal representation as shown in FIG. 6A, the compiler 140 may make the 
wait instruction 614 define a pseudo signal corresponding to the asynchronous 
signal extracted from the store instruction 612. The compiler 140 may further make 
20 an instruction 616 that depends on the completion of the store instruction 612 use 
the pseudo signal. Similarly, FIG. 6B illustrates another embodiment relating to a 
load instruction. In one embodiment, the pseudo register file 236 of FIG. 2 may 
comprise one or more pseudo signal registers that may each store a pseudo signal. 
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[0027] In one embodiment, the compiler 140 may map a register number to a signal 
register and a pseudo register. An example of the codes may be as follows, wherein 
V2SP may represent the corresponding signal register and pseudo register: 

Map ofint to pair: V2SP 

[0038] In another embodiment, the compiler 140 may use the following codes to 

express the define-use relation as DU and organize the define-use relation DL/as 
webs, wherein R is the number of register accesses: 

Relation ofR to R: DU; 
Partition Set ofR: webs; 

[0029] The following codes may be used by the compiler 140 to extract asynchronous 
10 signals and introduce pseudo signals to enforce the dependence for load 

instructions in a program. For example, the compiler 140 may execute the following 
operation: 

Build def-use relation for the registers defined in 
all load instructions in the program 

[0030] Then, the compiler 140 may construct webs based on DU relation, wherein r1 

15 and r2 may represent a pair of two factors in DU, i.e., define and use. For example: 

For each pair<r1 , r2 > in DU 
Join r1 and r2 in webs; 

[0031] In one embodiment, for each partition w in the webs f the compiler 140 may 

further map a register number v to a pair of signal s and pseudo p to obtain 5 and p 
from the corresponding signal register and pseudo register V2SP[v]. For each 
20 factor r in each partition w, the compiler 140 may further determine whether the 
register number v is defined in an instruction /. If yes, the compiler 140 may further 
make the instruction / define s explicitly. If not, the compiler 140 may generate a 
wait instruction to wait for signal s and make the wait instruction define p and use s 





explicitly, in response to determining that the instruction / is a load instruction. The 
compiler 140 may further make the instruction / use p explicitly, in response to 
determining that v is used in the instruction /. An example of the corresponding 
codes may be as follows: 



r w*s register number is v */ 

<s,p>=V2SP[v]; /*map v to signal s and pseudo pV 
for each occurrence rofw 

{ 

r i is the container instruction*/ 
if(v is defined here) 

{ 

Make i define s explicitly; 
if (i is a load instruction ) 
Generate an instruction to wait signal s 
and make the wait instruction define p and use s 
explicitly; 

)else/*v is used here*/ 



For each partition w in the webs 



{ 



{ 



Make i use p explicitly; 
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[0032] Similarly, an example algorithm is shown as follows for the compiler 140 to 



extract asynchronous signals and introduce pseudo signals to enforce the 



v 



dependence for store instructions, wherein the compiler 140 may use use-define 



relation UD. 
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Build use-define relation for the registers used in all 

store instructions in the program; 
^Construct webs based on UD relation */ 
For each pair<r1 J r2 > in UD 

Join rl and r2 in webs; 
For each partition win the webs 

{ 

r w's register number is v*/ 

<s, p>=V2SP[v]; /*map v to signal s and pseudo p*/ 
for each occurrence rofw 

{ 

l* \ is the container instruction*/ 
if (vis used here) 

{ 

Make i define s explicitly ; 

if(i is a store instruction ) 

Generate an instruction to wait signal s 
and make the wait instruction define p and use s 
explicitly; 
Jelse I* v is defined here*/ 
{ 

Make I use p explicitly; 

} 

} 

In order to schedule as many instructions as possible between issue of a 
memory access operation and its completion, the compiler 140 may perform code 
motion subject to the dependence constraint or order/relationship of instructions 
defined in a program. In block 304, the compiler 140 may further perform a first 
stage of code motion. For example, the compiler 140 may recognize a first set of 
one or more instructions in a program except I/O instructions as motion candidates 
and move the candidates forward subject to the dependence constraint of the 
program through one or more paths in the flow graph of the program. In one 
embodiment, the first stage of code motion may comprise a code sinking operation. 
For example, for the program as shown in FIG. 4B, the compiler 140 may recognize 
instructions 426 and the wait instruction 424 as motion candidates and may move 
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these instructions forward while fixing the location of memory access instruction 
422, so that a number of instructions between the issue and the completion of the 
memory access instruction 422 may be increased subject to a dependence 
constraint in the program. 

[00S4] In one embodiment, in the first stage of code motion, compiler 1 40 may further 
adopt speculative code motion for wait instructions, for example, in a situation as 
shown in FIG. 7 A. FIG. 7 A illustrates an embodiment of a flow graph, wherein block 
710 is a merging predecessor block for blocks 720 and 730 that are two 
predecessor blocks for block 740; and block 740 is a merging successor block of 
1 0 blocks 720 and 730; however, other embodiments may comprise a different 
structure. The first predecessor block 720 may comprise a wait instruction 724 
associated with a memory access instruction 722. The second predecessor block 
730 does not comprise a wait instruction. In this situation, the compiler 140 may not 
move forward or sink the wait instruction 724 into the merging successor block 740 
1 5 of the blocks 720 and 730 even if the dependence constraint of the flow graph 
allows. 

[0035] In order to perform the first stage of code motion speculatively, for example, in 
the situation as shown in FIG. 7A, in one embodiment, the compiler 1 40 may further 
insert or append one or more compensation codes to the second predecessor block 

20 730. Referring to FIG. 7B. the compiler 140 may insert a signal sending instruction 
734 and a second wait instruction 736 into the second predecessor block 730. The 
signal sending instruction 734 may send the asynchronous signal of the memory 
access instruction 722 to the second predecessor block 730 subject to the 
dependence constraint of the flow graph. The compiler 140 may generate a second 

25 wait instruction 736 that waits for the asynchronous signal in block 730. Then, as 
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shown in FIG. 7C, the compiler 140 may remove the two wait instructions 722 and 
736 from blocks 720 and 730, respectively. The compiler 140 may further prepend 
an instruction instance 742 of wait instructions 722 and 736 to the merging 
successor block 740 while fixing the memory access instruction 722 subject to the 
5 dependence constraint of the flow graph, so that a number of instructions between 
the issue and the completion of the memory access instruction 722 may be 
increased. 

[0036] In block 306, the compiler 140 may perform a second stage of code motion. In 
one embodiment, the compiler 140 may recognize a second set of one or more 
10 instructions in the program except wait instructions as motion candidates and move 
the candidates backward subject to the dependence constraint through the paths in 
the program. In one embodiment, the second stage of code motion may comprise a 
code hoisting operation. In another embodiment, for a motion candidate that 
depends on the completion of a memory access instruction, the compiler 140 may 
15 move the candidate backward to follow a wait instruction associated with the 
memory access instruction so as to accord with the dependence constraint 
between the candidate and the wait instruction. In one embodiment, the second 
instruction set may comprise one or more instructions that are comprised in the first 
instruction set. 

[0207] In one embodiment, the compiler 1 40 may perform a code sinking operation with 
I/O instruction fixed and a code hoisting operation with wait instruction fixed; 
however, in another embodiment, the compiler 140 may perform in a program, for 
example, a code hoisting operation with wait instruction fixed and a code sinking 
operation with I/O instruction fixed subject to dependence constraint of the 
25 program. 
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[0038] Tn the following, a description will be made on an example of codes that may be 

used by the compiler 140 to perform speculative code motion for wait instructions. 

In one embodiment, the compiler 140 may use the following codes to map an 

instruction / in a program to a motion candidate c {NC): 
5 Map ofint to int : NC ; 

[0039] The compiler 140 may map two instructions in a program to the same motion 

candidate NC, in response to determining that the two instructions are syntactically 

the same. 

[0040] In another embodiment, the compiler 140 may use the following code to 
1 0 represent a number of occurrence of an instruction in predecessor blocks as 

SinkCandidates, wherein the instruction is ready to sink into a successor block of 
the predecessor blocks subject to a dependence constraint: 

Vector of int: SinkCandidates; 
[0041] In one embodiment, the compiler 140 may build a map of motion candidates NC 
15 in a program, build a flow graph G for the program, and initialize a work queue 
Sinkqueue with basic blocks based on a topological order in the flow graph G. For 
example: 

Build the mapofNC that maps an instruction i to motion candidate c; 

Build the flow graph G for the program; 

Initialize a WorkQueue (SinkQueue ) with basic blocks 

based on the topological order in graph G 

[0042] The compiler 140 may determine whether the work queue is empty or not. In 
20 response to determining that the work queue comprises at least one basic block, 
the compiler 140 may dequeue a basic block o from the SinkQueue. The compiler 
140 may further build a set for all predecessor blocks of the basic block b as 
Predecessors. For each predecessor block p, the compiler 140 may put each 
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instruction / in p into a set of Ready, in response to determining that / is ready to sink 
into the basic block b subject to the dependence constraint of the program. 

While (SinkQueue is not empty) 
{ 

Dequeue a basic block b from SinkQueue; 

Build a set Predecessors for all predecessors of b; 

For each basic block p in Predecessors 

For each instruction i in basic blockp 

ifi is ready to sink into basic block b subject to dependence constraint 
Put Unto the set of Ready ; 



In response to determining that the SinkQueue is not empty, the compiler 140 

may further determine whether the set of Ready is empty. In response to 

determining that Ready comprises at least one instruction, i.e., not empty, the 

compiler 140 may further reset a number of ready instructions for each motion 

candidates or predecessor block SinkCandidates. For example: 

while(Ready is not empty) 
{ 

/*reset the number of the ready instructions for each motion candidate to 0 */ 
Reset SinkCandidates; 

For each instruction / in Ready, the compiler 140 may record or calculate 
SinkCandidates[NC[i]] t i.e., a number of occurrence of the instruction / in different 
predecessors of the basic block b. For each instruction / in Ready, in response to 
determining that the number SinkCandidates[NC[i]] is less than the number of 
predecessor blocks of b, the compiler 1 40 may further determine whether the 
current candidate is a wait instruction. In response to determining that the current 
candidate is a wait instruction, the compiler 140 may append compensation code to 
the predecessor blocks where the current candidate is not ready, such as a 
situation shown in FIG. 7A and make SinkCandidates[NC[i]] equal to the number of 

predecessor blocks of b (FIG. 7B). On the other hand, for each instruction / in 
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Ready, in response to determining that SinkCandidates[NCp]] equals to the number 

of predecessor blocks of b, the compiler 140 may remove all instructions 

corresponding to the current candidate from all predecessor blocks of b and 

prepend an instruction instance of the candidate to b (FIG. 7C). The compiler 140 

may further update the dependence constraint relating to all predecessor blocks 

and may update the set of Ready. An example of codes may be as follows: 

For each instruction! in Ready 
{ 

SinkCandidates [NC[i]]++; 

} 

For each instruction! in Ready 
{ 

if(SinkCandidates[NC[i]]< The number of predecessors ofb) 

if (The current candidate which NC[i] indicates is a WAIT instrution) 

Append the compensation code to the blocks in 
Predecessors where this candidate is not ready; 
SinkCandidates[NC[ i]] = The number of predecessors of b; 

} 

} 

if(SinkCandidates[NC[ ij] == The number of predecessors of b) 

Remove all the instructions corresponding to this candidate 
from all the predecessors ofb; 

Prepend an instruction instance of this candidate to basic block b; 
Update the dependence constraint of all predecessor blocks ; 
Update the Ready set ;/* May introduce more ready instructions*/ 

} 

} 



In another embodiment, the compiler 140 may further enqueue successor 
blocks of the current block b of G in SinkQueue, in response to any change when 
the SinkQueue is not empty. For example: 

If any change happens, enqueue the current block's 
successors in G in SinkQueue 

16 
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[0046] FIG. 8 is a block diagram that illustrates a compiler 800 according to an 

embodiment of the present invention, The compiler 800 may comprise a compiler 
manager 810. The compiler manager 810 may receives source code to compile. 
The compiler manager 810 may interface with and transmit information between 
5 other components in the compiler 800. 
[0047] The compiler 800 may comprise a front end unit 820. In one embodiment, the 
frond end unit 820 may parse source code. An intermediate language unit 830 in 
the compiler 800 may transforms the parsed source code from the front end unit 
820 into one or more common intermediate forms, such as an intermediate 

10 representation. For example, referring to FIG. 6A and 6B, the intermediate 
language unit 820 may extract an asynchronous signal from a memory access 
instruction and may define the asynchronous signal in the memory access 
instruction explicitly. The intermediate language unit 820 may further make a wait 
instruction associated with the memory access instruction wait for or use the 

1 5 asynchronous signal and may define a pseudo signal associated with the 
asynchronous signal, with reference to 614 and 624 in FIGs. 6A and 6B. The 
intermediate language unit 820 may make a memory access dependent instruction, 
such as 616 and 626 in FIGs. 6A and 6B, use the pseudo signal, in one 
embodiment, a network device, for example, as shown in FIG. 2, may comprise a 

20 signal register and a pseudo register to store the asynchronous signal and the 
pseudo signal, respectively. 
[0048] In one embodiment, the compiler 800 may further comprise a code motion unit 
840. The code motion unit 840 may perform, for example, global instruction 
scheduling. In one embodiment, the code motion unit 840 may perform code motion 

25 as described in blocks 304 and 306 of FIG. 3. For example, the code motion unit 
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840 may move an instruction from a predecessor block into a successor block 
subject to a dependence constraint of a program, for example, as shown in FIG. 7C. 
In another embodiment, the code motion unit 840 may move an instruction from a 
successor block into a predecessor block subject to a dependence constraint of a 
5 program. In yet another embodiment, the code motion unit 840 may perform 

speculative code motion for wait instructions subject to a dependence constraint of 
a program, for example, as shown in FIGs. 7B and 7C. In one embodiment, the 
compiler 800 may comprise one or more code motion unit 840 that may each 
perform one code motion, such as, for example, a first code motion, a second code 
1 0 motion and/or a speculative code motion. In another embodiment, the compiler 800 
may comprise other code motion units for different code motions. 

[0049] The compiler 800 may further comprise a code generator 850 that may convert 
the intermediate representation into machine or assembly code. 

[0050] While certain features of the invention have been described with reference to 
15 embodiments, the description is not intended to be construed in a limiting sense. 
Various modifications of the embodiments, as well as other embodiments of the 
invention, which are apparent to persons skilled in the art to which the invention 
pertains are deemed to lie within the spirit and scope of the invention. 
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What is claimed is: 

1. A method comprising: 

extracting an asynchronous signal from a memory access instruction in a 
program to represent a latency of the memory access instruction; and 
generating a wait instruction to wait for the asynchronous signal. 

2. The method of claim 1 , further comprising: 

enforcing a first dependence between the memory access instruction and 
the wait instruction via the asynchronous signal. 

3. The method of claim 1 , further comprising: 

introducing a pseudo signal to enforce a second dependence between the 
wait instruction and a memory access dependent instruction. 

4. The method of claim 1 , further comprising: 

making the memory access instruction define the asynchronous signal; and 
making the wait instruction use the asynchronous signal. 

5. The method of claim 1 , further comprising: 
making the wait instruction define a pseudo signal; and 

making an instruction that depends on the completion of the memory access 
instruction use the pseudo signal. 

6. The method of claim 1 , further comprising: 

storing the asynchronous signal in a signal register of a network device. 





7. The method of claim 3, further comprising: 



storing the pseudo signal in a pseudo signal register of a network device. 
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8. A method, comprising subject to a dependence constraint of a program: 



performing a first code motion on a first set of one or more instructions 
except each memory access instruction in the program, and 

performing a second code motion on a second set of one or more 
instructions except each wait instruction in the program, to increase a number of 
10 instructions between issue and completion of the memory access instruction. 

9. The method of claim 8, wherein the first code motion comprises moving 
the first instruction set forward through one or more paths of the program with the 
memory access instructions fixed, and the second code motion comprises moving 
15 the second instruction set backward through the one or more paths of the program 
with the wait instructions fixed. 



1 0. The method of claim 8, wherein the first code motion comprises sinking 
the one or more instructions in the first set that occur in each predecessor block of a 
20 successor block into the successor block, and the second code motion comprises 
hoisting the one or more instructions in the second set. 
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1 1 . The method of claim 8, comprising: 

performing a speculative code motion on a wait instruction, in response to 
determining that the wait instruction is absent in at least one predecessor blocks of 
a successor block. 



12. The method of claim 8, comprising: 

in response to determining that the number of occurrence of a wait 
instruction in predecessor blocks of a successor block is less than the number of 
the predecessor blocks, appending a compensation code for the wait instruction to 
10 one or more predecessors that lack the wait instruction; 

removing the wait instruction from the predecessors; and 

prepending an instruction instance of the wait instruction to the successor 

block. 

15 1 3. A compiler, comprising: 

a code motion unit to perform code motion in a program subject to a 
dependence constraint of the program to hide a latency of a memory access 
instruction in the program. 

20 1 4. The compiler of claim 1 3, further comprising: 



5 



an intermediate language unit to represent a memory access instruction in a 



program with an asynchronous signal associated with a latency of the memory 



access instruction. 
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15. The compiler of claim 13, further comprising: 

21 



an intermediate language unit to define an asynchronous signal in the 
memory access instruction to represent the latency and to generate a wait 
instruction that uses the asynchronous signal. 

16. The compiler of claim 13, further comprising: 

an intermediate language unit to define a pseudo signal in a wait instruction 
associated with the memory access instruction and to make an instruction that 
depends on the memory access instruction use the pseudo signal. 

17. The compiler of claim 13, wherein the code motion unit further to 
move a wait instruction associated with the memory access instruction and a 

first set of one or more instructions in a first direction subject to the dependent 
constraint, with the memory access instruction fixed; and 

move the memory access instruction and a second set of one or more 
instructions in the program subject to the dependent constraint in a second direction 
that is opposite to the first direction, with the wait instruction fixed. 

18. The compiler of claim 13, wherein the code motion unit further to 

sink a wait instruction associated with the memory access instruction and a 
first set of one or more instructions of the program from each predecessor block to a 
successor block at a merging point of the predecessor blocks subject to the 
dependence constraint of the program, in response to determining that each 
predecessor block comprises the wait instruction and the one or more instructions, 
with the memory access instruction fixed; and 
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hoist the memory access instruction and a second set of one or more 
instructions in the program subject to the dependent constraint, with the wait 
instruction fixed. 

1 9. The compiler of claim 1 3, wherein the code motion unit further to 
perform a speculative code motion on a wait instruction associated with the 

memory access instruction, in response to determining that the wait instruction is 
present in a first predecessor block of a merging successor block of the program 
and is absent in a second predecessor block of the merging successor block. 

20. The compiler of claim 1 3, wherein the code motion unit further to 
recognize a wait instruction associated with the memory access instruction 

as a motion candidate subject to a dependence constraint of the program; 

in response to determining that the wait instruction is present in a first 
predecessor block of the merging successor block and is absent in a second 
predecessor block of the merging successor block, insert a compensation code for 
the wait instruction into the second predecessor block; and 

sink the wait instruction into a merging successor block of the first and 
second predecessor blocks subject to the dependence constraint. 

21 . The compiler of claim 20, wherein the code motion unit further to 
hoist the memory access instruction subject to the dependence constraint. 
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22. A machine readable medium comprising a plurality of instructions that in 
response to being executed result in a computing device 

determining a motion candidate from one or more predecessor blocks of a 
first block of a program based on a dependence constraint of the program; and 

performing a code motion on an instruction corresponding to the motion 
candidate to hide a latency associated with a memory access instruction. 

23. The machine readable medium of claim 22, wherein the machine 
readable medium further comprising instructions that in response to being executed 
result in the computing device 

in response to determining that a number of occurrence of the candidate in 
the predecessor blocks is smaller than a number of predecessor blocks and in 
response to determining that the candidate is a wait instruction, appending a 
compensation code to one or more of the predecessor blocks where the candidate 
is absent. 

24. The machine readable medium of claim 23, wherein the machine 
readable medium further comprising instructions that in response to being executed 
result in the computing device 

appending a wait instruction corresponding to the candidate to each of said 
one or more predecessor blocks where the candidate is absent. 
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25. The machine readable medium of claim 24, wherein the machine 
readable medium further comprising instructions that in response to being executed 
result in the computing device 

sinking each wait instruction corresponding to the candidate in each 
predecessor blocks of the first block into the first block. 

26. The machine readable medium of claim 22, wherein the machine 
readable medium further comprising instructions that in response to being executed 
result in the computing device 

in response to determining that a number of occurrence of the candidate in 
the predecessor blocks equals to a number of the predecessor blocks, removing 
each instruction corresponding to the candidate from each predecessor block of the 
first block; and 

prepending an instruction instance of the candidate to the first block. 

27. The machine readable medium of claim 26, wherein the machine 
readable medium further comprising instructions that in response to being executed 
result in the computing device 

updating a dependent constraint of predecessor blocks of the first block. 
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28. The machine readable medium of claim 22, wherein the machine 
readable medium further comprising instructions that in response to being executed 
result in the computing device 

determining a sinking candidate from one or more instructions of the 
5 program except the memory access instruction, based on a dependence constraint 
of the program; 

performing a code sinking on each instruction corresponding to the sinking 
candidate subject to the dependence constraint; 

determining a hoisting candidate from one or more instructions of the 
10 program except a wait instruction associated with the memory access instruction, 
based on the dependence constraint of the program; and 

performing a code hoisting on each instruction corresponding to the hoisting 
candidate subject to the dependence constraint. 
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ABSTRACT 

Various embodiments that may be used in performing speculative code 
motion for memory latency hiding are disclosed. One embodiment comprises 
extracting an asynchronous signal from a memory access instruction in a program 
5 to represent a latency of the memory access instruction, and generating a wait 
instruction to wait the asynchronous signal. 
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